Current Issue : October - December Volume : 2015 Issue Number : 4 Articles : 4 Articles
Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech\nrecognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study,\na bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation\nare presented for distant-talking speaker identification, and a combination of these two approaches is proposed. For\nthe DNN-based bottleneck feature, we noted that DNNs can transform the reverberant speech feature to a new\nfeature space with greater discriminative classification ability for distant-talking speaker recognition. Conversely,\ncepstral domain DAE-based dereverberation tries to suppress the reverberation by mapping the cepstrum of\nreverberant speech to that of clean speech with the expectation of improving the performance of distant-talking\nspeaker recognition. Since the DNN-based discriminant bottleneck feature and DAE-based dereverberation have a\nstrong complementary nature, the combination of these two methods is expected to be very effective for\ndistant-talking speaker identification. A speaker identification experiment was performed on a distant-talking speech\nset, with reverberant environments differing from the training environments. In suppressing late reverberation, our\nmethod outperformed some state-of-the-art dereverberation approaches such as the multichannel least mean\nsquares (MCLMS). Compared with the MCLMS, we obtained a reduction in relative error rates of 21.4% for the\nbottleneck feature and 47.0% for the autoencoder feature. Moreover, the combination of likelihoods of the\nDNN-based bottleneck feature and DAE-based dereverberation further improved the performance....
Estimating the directions of arrival (DOAs) of multiple simultaneous mobile sound sources is an important step for\nvarious audio signal processing applications. In this contribution, we present an approach that improves upon our\nprevious work that is now able to estimate the DOAs of multiple mobile speech sources, while being light in\nresources, both hardware-wise (only using three microphones) and software-wise. This approach takes advantage of\nthe fact that simultaneous speech sources do not completely overlap each other. To evaluate the performance of this\napproach, a multi-DOA estimation evaluation system was developed based on a corpus collected from different\nacoustic scenarios named Acoustic Interactions for Robot Audition (AIRA)....
Optimal automatic speech recognition (ASR) takes place when the recognition system is tested under\ncircumstances identical to those in which it was trained. However, in the actual real world, there exist many sources\nof mismatches between the environment of training and the environment of testing. These sources can be due to\nthe sources of noise that exist in real environments. Speech enhancement techniques have been developed to\nprovide ASR systems with the robustness against the sources of noise. In this work, a method based on histogram\nequalization (HEQ) was proposed to compensate for the nonlinear distortions in speech representation. This\napproach utilizes stereo simultaneous recordings for clean speech and its corresponding noisy speech to compute\nstereo Gaussian mixture model (GMM). The stereo GMM is used to compute the cumulative density function (CDF)\nfor both clean speech and noisy speech using a sigmoid function instead of using the order statistics that is used\nin other HEQ-based methods. In the implementation, we show two choices to apply HEQ, hard decision HEQ and\nsoft decision HEQ. The latter is based on minimum mean square error (MMSE) clean speech estimation. The\nexperimental work shows that the soft HEQ and hard HEQ achieve better recognition results than the other HEQ\napproaches such as tabular HEQ, quantile HEQ and polynomial fit HEQ. It also shows that soft HEQ achieves notably\nbetter recognition results than hard HEQ. The results of the experimental work also show that using HEQ improves\nthe efficiency of other speech enhancement techniques such as stereo piece-wise linear compensation for\nenvironment (SPLICE) and vector Taylor series (VTS). The results also show that using HEQ in multi style training\n(MST) significantly improves the ASR system performance....
This paper presents an objective speech quality model, ViSQOL, the Virtual Speech Quality Objective Listener. It is a\nsignal-based, full-reference, intrusive metric that models human speech quality perception using a spectro-temporal\nmeasure of similarity between a reference and a test speech signal. The metric has been particularly designed to be\nrobust for quality issues associated with Voice over IP (VoIP) transmission. This paper describes the algorithm and\ncompares the quality predictions with the ITU-T standard metrics PESQ and POLQA for common problems in VoIP:\nclock drift, associated time warping, and playout delays. The results indicate that ViSQOL and POLQA significantly\noutperform PESQ, with ViSQOL competing well with POLQA. An extensive benchmarking against PESQ, POLQA, and\nsimpler distance metrics using three speech corpora (NOIZEUS and E4 and the ITU-T P.Sup. 23 database) is also\npresented. These experiments benchmark the performance for a wide range of quality impairments, including VoIP\ndegradations, a variety of background noise types, speech enhancement methods, and SNR levels. The results and\nsubsequent analysis show that both ViSQOL and POLQA have some performance weaknesses and under-predict\nperceived quality in certain VoIP conditions. Both have a wider application and robustness to conditions than PESQ or\nmore trivial distance metrics. ViSQOL is shown to offer a useful alternative to POLQA in predicting speech quality in\nVoIP scenarios....
Loading....